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A SILENCE COMPRESSION SCHEME FOR G.729 OPTIMIZED FOR TERMINALS 
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Summary 

Annex B to G.729 defines a voice activity detector and comfort noise generator for use with G.729 or 
Annex A optimized for V.70 DSVD applications. 
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Recommendation G.729 - Annex B 



A SILENCE COMPRESSION SCHEME FOR G.729 OPTIMIZED FOR TERMINALS 
CONFORMING TO RECOMMENDATION V.70 

(Geneva, 1996) 



B.l Introduction 

This annex provides a high level description of the Voice Activity Detection (VAD), Discontinuous 
Transmission (DTX), and Comfort Noise Generator (CNG) algorithms. These algorithms are used to 
reduce the transmission rate during silence periods of speech. They are designed and optimized to 
work in conjunction with Recommendation V.70. Recommendation V.70 mandates the use of 
Annex A/G.729 (G.729A) speech coding methods. However, when it is desirable, the full version of 
Recommendation G.729 can also be used to improve the quality of the speech. The algorithms are 
adapted to operate with both the full version of Recommendation G.729 and Annex A/G.729. This 
description is for the fiill version of Recommendation G.729, the only difference for Annex A is 
indicated in B.3.1.1. A block diagram of a silence compression speech communication system is 
depicted in Figure B.l. 



Incoming 
Speech 



Non_Active 
Voice 
Encoder 



Non_Active 

Voice 
Bit Stream 



Active 
Voice 
Encoder 



Active 
Voice 

Bit 
Stream 



VAD 



Communication 
Channel 



Non_Active 
Voice 
Decoder 



Active 
Voice 
Decoder 



Reconstructed 
Speech 



VAD 
Decision 



T1522560>96 



Speech Encoder 



Speech Decoder 



nGUREB.l/G.729 
Speech communication system with VAD 



B.2 General description of the VAD/DTX/CNG algorithms 

The VAD algorithm makes a voice activity decision every 10 ms in accordance with the frame size 
of the G.729 speech coder. A set of difference parameters is extracted and used for an initial 
decision. The parameters are the full band energy, the low band energy, the zero-crossing rate and a 
spectral measure. The long-term averages of the parameters during non-active voice segments follow 
the changing nature of the background noise. A set of differential parameters is obtained at each 
frame. These are a difference measure between each parameter and its respective long-term average. 
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The initial voice activity decision is obtained using a piecewise linear decision boundary between 
each pair of differential parameters. A final voice activity decision is obtained by smoothing the 
initial decision. 

The output of the VAD module is either 1 or 0, indicating the presence or absence of voice activity 
respectively. If the VAD output is 1, the G.729 speech codec is invoked to code/decode the active 
voice frames. However, if the VAD output is 0, the DTX/CNG algorithms described herein are used 
to code/decode the non-active voice frames. Traditional speech coders and decoders use comfort 
noise to simulate the background. noise in the non-active voice frames. If the background noise is not 
stationary, a mere comfort noise insertion does not provide the naturalness of the original 
background noise. Therefore it is desirable to intermittently send some information about the 
background noise in order to obtain a better quality when non-active voice frames are detected. The 
coding efficiency of the non-active voice frames can be achieved by coding the energy of the frame 
and its spectrum with as few as fifteen bits. These bits are not automatically transmitted whenever 
there is a non-active voice detection. Rather, the bits are transmitted only when an appreciable 
change has been detected with respect to the last transmitted non-active voice frame. 

At the decoder side, the received bit stream is decoded. If the VAD output is 1, the G.729 decoder is 
invoked to synthesize the reconstructed active voice frames. If the VAD output is 0, the CNG 
module is called to reproduce the non-active voiced frames. 

B.3 Detaileddescriptionof the VAD algorithm 

A flowchart of the VAD operation is given in Figure B.2. The VAD operates on frames of digitized 
speech. The frames are processed in time order and are consecutively numbered from the beginning 
of each conversation/recording. 

At the first stage, four parametric features are extracted from the input signal. Extraction of the 
parameters is shared with the active voice encoder module and the non-active voice encoder for 
computational efficiency. The parameters are the fiill and low-band frame energies, the set of Line 
Spectral Frequencies (LSF) and the frame zero crossing rate. 

If the frame number is less than M, an initialization stage of the long-term averages takes place, and 
the voice activity decision is forced to 1 if the frame energy from the LPC analysis, is above 15 dB 
(see equation B.l). Otherwise, the voice activity decision is forced to 0. If the frame number is equal 
to Ni, an initialization stage for the characteristic energies of the background noise occurs. 

At the next stage a set of difference parameters are calculated. This set is generated as a difference 
measure between the current frame parameters and running averages of the background noise 
characteristics. Four difference measures are calculated: 

• a spectra] distortion; 

• an energy difference; 

• a low-band energy difference; 

• a zero-crossing difference. 

The initial voice activity decision is made at the next stage, using multi-boundary decision regions in 
the space of the four difference measures. The active voice decision is given as the union of the 
decision regions and the non-active voice decision is its complementary logical decision. Energy 
consideration, together with neighbouring past frames decisions, are used for decision smoothing. 

The running averages have to be updated only in the presence of background noise, and not in the 
presence of speech. An adaptive threshold is tested, and the update takes place only if the threshold 
criterion is met. 
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B.3.1 Parameter extraction 

For each frame a set of parameters is extracted from the speech signal. The parameters extraction 
module can be shared between the VAD, the active voice encoder and the non-active voice encoder. 
The basic set of parameters is the set of autocorrelation coefficients, which is derived similarly to 
Recommendation G.729 (see 3.2.1). The set of autocorrelation coefficients will be denoted by: 

{^(0}JLo»^here9 = 12 
B.3.1.1 Line Spectral Frequencies (LSF) 

A set of linear prediction coefficients is derived from the autocorrelation and a set of {L^f)}^^^, 

where p - 10, is derived from the set of linear prediction coefficients, as described in 3.2.3/G.729 or 

A. 3,2.3/G.729. 

B. 3.1.2 Full band energy 

The full band energy Ej is the logarithm of the normalized first autocorrelation coefficient R(0) : 

£j.=10.1og,o[^/?(0)] (B.l) 

where N = 240 is the LPC analysis window size in speech samples. 
B.3.1.3 Low band energy 

The low band energy Ei measured on 0 to F/ Hz band, is computed as follows: 

£;=10.1ogio[^h'^Rhj (B.2) 

where h is the impulse response of an FIR filter with cutoff frequency at F/ Hz, R is the Toeplitz 
autocorrelation matrix with the autocorrelation coefficients on each diagonal. 

B.3.1.4 Zero crossing rate 

Normalized zero-crossing rate ZC for each frame is calculated by: 

. ZC = -i- S[|sgn[x(i)]-sgn[;c(/-l)]] (B.3) 

where {x(i)} is the pre-processed input signal (see 3.1/G.729) and M = 80 . 

B.3.2 Initialization of the running averages of the background noise characteristics 

For the first frames, the spectral parameters of the background noise, denoted by {iSFj j^^ are 

initialized as an average of the {iSFi}^^^ of the frames. The average of the background noise 

zero-crossings, denoted by ZC is initialized as an average of the zero crossing rate ZC of the 
frames. 

The running averages of the background noise energy, denoted by E / , and the background noise 
low-band energy, denoted by £/, are initialized as follows. First, the initialization procedure uses 
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En , defined as the average of the frame energy E f over the first Ni frames. These three averaging 

(En , ZC , and{L5F,}^^) include only the frames that have an energy E greater than 15 dB. 
Second, the initialization procedure continues as follows: 

if En <T\ then 

Ef-^-^Ko 

El = £>i + ^i 
else if T\<En<T2 then 
Ef.-En + Kl 

El =ln + A:3 
else 

^f=En-\-K4 
Ei = En + K5 
See Table B.l for constant values. 
B.3.3 Generating the long-term minimum energy 

A long-term minimum energy parameter, Ermny is calculated as the minimum of Ef over A^o previous 
frames. Since No is relatively large, £niin is calculated using stored values of the minimum of over 
short segments of the past. 

B.3.4 Generating the difference parameters 

Four difference measures are generated from the current frame parameters and the running averages 
of the background noise. 

B.3.4.1 The spectral distortion AS 

The spectral distortion measure is generated as the sum of squares of the difference between the 
current frame {LSFi}jL^ vector and the running averages of the background noise {iSFi }^^^ : 

AS^f,{LSFi-LSFif (B.4) 
f=i 

BJ.4.2 The full-band energy difference AEy^ 

The full-band energy difference measure is generated as the difference between the current frame 
energy, £/, and the running average of the background noise energy, Ef : 

AEf=Ef-Ef (B.5) 
B.3.4.3 The low-band energy difference A£J 

The low-band energy difference measure is generated as the difference between the ciirrent frame 
low'band energy, E,, and the running average of the background noise low-band energy, Ei : 

^Ei^ El-El (B.6) 
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B.3.4.4 The zero-crossing difTerence AZC 

The zero-crossing difference measure is generated as the difference between the current frame 
zero-crossing rate, ZC, and the running average of the background noise zero-crossing rate, ZC: 

/SZC^TC-ZC (B.7) 

B.3.5 Multi-boundary initial voice activity decision 

The initial voice activity decision is denoted by /vz>, and is set to 0 ("FALSE") if the vector of 
difference parameters lies within the non-active voice region. Otherwise, the initial voice activity 
decision is set to 1 ("TRUE"). The fourteen boundary decisions in the four-dimensional space are 
defined as follows: 

1) if A5>a, AZC+fei then /v/|) =1 

2) if A5>a2 * AZC +^2 then /y^ =1 

3) if AEyr < a3 . AZC + ^3 then /^^ =1 . 

4) if LEf < ■ AZC-^b^ then ly^ =1 

5) if AEf < 65 then /y.^ = 1 

6) if AEf <a^ AS+b^ then /yo = 1 

7) if ^S>b'J then /y^ = 1 

8) if AE, < flg ■ AZC+fcg then /y^j = 1 

9) if AEi <ag' bZC^-h^ then /^d = 1 

10) if AE/ <hy^ then /y^ = 1 

11) if Af; <aii AS + ^n then =1 

12) if AE/ > aYi*b£j +fei2 then /^^ = 1 

13) if tsEi < • AEf +Z>,3 then /^d = 1 

14) if A£/ <ai4 • AEyr +b^^ then ly^-l 

If none of the fourteen conditions is "TRUE" /w= 0. See Table B.l for constant values. 
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TABLE B.1/G.729 
Table of constants 



Name 


Constant 


Name 


Constant 




32 




4 




128 




10 




0 




671088640 


Ki 


-53687091 




738197504 


K2 


-67108864 


h 


26843546 


A^a 


^93952410 




40265318 


K, 


-134217728 


T, 


40265318. 


Ks 


-161061274 


Te 


40265318 


(i\ 


23488 


b: 


28521 


02 


-30504 


bi 


1944.6 


^3 


-32768 


bi 


-32768 


04 


26214 


bA 


-19661 


0$ 


0 


bs 


-30802 


06 


28160 


be 


-19661 


fl? 


0 


bi 


30199 


fl8 


16384 


bz 


-22938 




-19065 


b9 


-31576 


flio 


0 


biu 


-17367. 


Oil 


22400 


bu 


-27034 




30427 


bi2 


29959 


an 


-24576 


bn 


-29491 




23406 


b\A 


-28087 



B.3.6 Voice activity decision smoothing 

The initial voice activity decision is smoothed (hangover) to reflect the long-term stationarity nature 
of the speech signal. The smoothing is done in four stages. 

A flag indicating that hangover has occurred is defined as v_ flag . It is set to zero each time before 
the voice activity decision smoothing is performed. Denote the smoothed voice activity decision of 
the frame, the previous frame and frame before the previous frame by Sy^, Sy^ and S^q, 

respectively. S^q is initialized to 1, and Sy^ is initialized to 1. For start Syj^ -lyo- The first 
smoothing stage is: 

if (/yo = 0) and (Sy^ = l) and (e > Ej +13) then = 1 and v_flag = 1 

For the second smoothing stage define a Boolean parameter and a smoothing counter Q . 

is initialized to 1 and is initialized to 0. Denote the energy of the previous frame by E^^ . The 

second smoothing stage is: 
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if (py^ = l) and (lyo = o) and (Syo = l) and {Syl = l) and (|£y - £_i| < ) { 



SvD - 1 

v_flag = \ 
.if(C,<N,) { 



} 

else { 



C,=0 



else 

Fy^ = l 

For the third smoothing stage define a noise continuity counter C, which is initialized to 0. If 
SvD - 0 then C, is incremented. The third smoothing stage is: 

if(5^D=l)and(C,>Ar2)and(E^-£_,<r5) { 

5^0 =0 
C, = 0 

if (5^o = l)c.=0 

In the fourth stage, a voice activity decision is made if the following condition is satisfied: 
if ((e^ KEf^ T^) and (frm^count > Nq) and {v_flag = O)) then 5^0 = 0 

B.3.7 Updating the running averages of the background noise characteristics 

The running averages of the background noise characteristics are updated at the last stage of the 
VAD module. At this stage, the following condition is tested and the updating takes place if the 
following condition is met: 

if (Ej^ < Ef +7^) then update 

The running averages of the background noise characteristics are updated using a first order 
Auto-regressive (AR) scheme. Different AR coefficients are used for different parameters, and 
different sets of coefficients are used at the beginning of the recording/conversation or when a large 
change of the noise characteristics is detected. 
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Let be the AR coefficient for the update of £/ , be the AR coefficient for the update of 
El, p2C coefficient for the update of ZC and P be the AR coefficient for the update 

of {l5F/}^^j . The total number of frames where the update condition was satisfied is counted by C 
Different set of the coefficients p £^ , P £^ , ^zc ' P lsf is used according to the value of C 

The AR update is done according to: 

^/=P£,-^/+(l-P£,)-^/ . (B.8) 

ZC = pzcZC+(l-p2c)zC 
L5F,-=p^^'LSFy+(l-p^^).L5f;- / = 
E f and C„ are further updated according to: 
if (frame count > A^o) {^f ^ ^min){ 

} 

B.4 Detailed description of the DTX/CNG algorithms 

The DTX/CNG algorithms provide continuous and smooth information about the non-active voice 
periods, while keeping a low average bit rate. 

B.4.1 Description of the DTX algorithm 

For each non-active voice frame, the DTX module decides if a set of non-active voice update 
parameters ought to be sent to the speech decoder, by measuring the changes in the non-active voice 
signal. Absolute and adaptive thresholds onlhe frame energy and the spectral distortion measure are 
used to obtain the update decision. If an update is needed, the non-active voice encoder sends the 
information needed to generate a signal which is perceptually similar to the original non-active voice 
signal. This information is comprised of an energy level and a description of the spectral envelope. If 
no update is needed, the non-active voice signal is generated by the non-active decoder according to 
the last received energy and spectral shape information of a non-active voice frame. 

However, a minimum interval of AL, = 2 frames is required between two consecutive SID frames i.e. 
if a spectral or level change has occurred n < ALi frames after a SID frame, the SID emission is 
delayed. 

Situated at the transmitting end, the DTX module receives from the VAD module the 
active/non-active voice information, and from the encoder modules the autocorrelation function of 
the speech signal computed for each 80 sample frame and the past excitation sample. For each frame, 
the DTX decision Ftyp^ (Frame type for frame numbered 0 is output as one of the three values, 0, 1, 
or 2 corresponding to untransmitted frame, active speech frame or SID frame, respectively, according 
to the following procedure: 
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B.4.1.1 Store the frame autocorrelation function 

For every frame t (active or inactive), the autocorrelation coefficients of the current frame r, 
including the bandwidth expansion and noise correction (see the G.729 description) are retained in 
memory. The set of frame t autocorrelations will be denoted r/ij) , for jf = 0 to 10. 

B.4.1.2 Computation of the current frame type 

If the current frame t is an active speech frame (Vad, = 1), then the current frame type Fryp, = 1 and 
the normal speech encoder processing continues. 

In the other case, a current LPC filter A, (z) calculated over = 2 previous frames including the 
current one t is first evaluated: 

The Ncur autocorrelation functions are summed: 

R'iJh triiA y = otoio (B.9) 

and Afiz) is calculated by the Levinson-Durbin procedure (see the G.729 description) using (y) 
as input. The coefficients of this filter will be noted a,<j)J = 0 to 10. The Levinson-Durbin procedure 
also provides the residual energy , that will be rescaled and used as an estimate of the frame 
excitation energy. 

Then the current frame type Ftyp, is determined in the following way: 

• If the current frame is the first inactive frame of the inactive zone, the frame is selected as 
SID frame. The variable £ which reflects the energy sum is taken equal to £„ and the 
number of frames involved in the summation, ke, is initialized to 1 : 



(Varf,.i=l) = 



Fryp, =2 

£ = (B.IO) 



For the other frames, the algorithm corhpares the preceding SID parameters to the current 
ones: if the current filter is significantly different of the preceding SID filter, or if the current 
excitation energy significantly differs from the preceding SID energy, the f\Sig flag^chang is 
set to 1 , else it does not change. 

The counter countjr indicating how many frames are elapsed since the previous SID frame 
is incremented. If its value is greater than A^^, the emission of a SID frame is allowed. Then 
\f flagjchang is equal to 1, a SID frame is sent. In all other cases, the current frame is 
untransmitted: 



count _fr>.N^^ 
flag_chang = l 



Ftyp.^l . (B.ll) 



Otherwise: Ftyp^ - 0 

In case of a SID frame, the counter cownr Jr and the flag flag_chang are re-initialized to 0. 
LPC filters and energies are compared according to the following methods: 
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B.4.1.3 Comparison of the LPC filters 

The previous SID LPC filter will be noted A,,-^(z) and its coefficients flj^CyXj = 0 to 10 (the 
evaluation of this filter is described in B.4.2.2). The current and previous SID-LPC filters are 
considered as significantly different if the Itakura distance between the two filters exceeds a given 
threshold, which is expressed by : 

10 

'^R,(i)xR'(i)>E,xthrl (B.liZ) 

where RaQ'),] = 0 to 10 is a function derived from the autocorrelation of the coefficients of the SID 
filter, given by: 

lo-y 

0) = 2 ik)xa,ia {k + j) if 0 

lo'^ (B.13) 

A value of 1 .20226 is used for thrl , 
B.4.1.4 Comparison of the energies 

The sum the frame energies is calculated, being first incremented up to the maximum value 
Ng = 2: 



i=r-/:£+l 

Then E is quantized, using the 5-bits logarithmic quantizer described in B.4.2.1. The decoded log- 
energy Eg is compared to the previous decoded SID log-energy E^^^ . If the difference exceeds die 
threshold thr2-2 dB, the two energies will be considered as significantly different. 

B.4.2 SID evaluation and quantization 

The Silence Insertion Descriptor (SID) is comprised of the quantized frame excitation energy (i.e. the 
current quantized excitation energy Q(E) for the SID frames) and the quantized LSPs corresponding 
to the estimated SDD-LPC filter. Four indices make up the SID frame. One index describes the energy 
and three indices describe the spectrum portion of the SDD frame. 

BA2.1 Energy quantization 

The quantization of the energy E is performed as follows. First, a scaling factor Oh = 0.125 is 
introduced that takes into account the effect of windowing and bandwidth expansions present in the 
subframes autocorrelation functions r\j) . 

The value used at the input of the gain quantizer is: 

^' = awX-; (B.15) 

The energy term E' is quantized with a 5-bit non-uniform quantizer in the logarithmic domain in the 
range of -12 dB to 66 dB. A uniform step size of 2 dB is used between 16 dB and 66 dB. A step size 
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of 4 dB is used in the range of -4 dB to 16 dB. Below -4 dB, a single step size of 8 dB is used giving 
a quantization level of -12 dB. The quantization is straightforward and does not need the storage of a 
quantizer table. 

Notice that since the energy comparison (B.4.1.4) is performed with decoded energies, the 
quantization of the energy is done for all non-active voice frames. 

B.4.2.2 SID-LPC filter estimation and quantization 

The SID-LPC filter estimation takes into account the local statipnarity or non-stationarity of the noise 
at the SID frame neighbourhood. 

First, a past average filter A^(z) built from Np frames preceding the current SID one is calculated, 
using the following autocorrelation sum as input of the Levinson-Durbin procedure: 

RpOh trM J -0 to \0 (B.16) 
The number of frames involved in the summation has been fixed to Np = 6. 

The frame number /' varies in [r - 1, r - A^cwr]' depending on the rest of the Euclidian division of the 

current frame number t by 

The SID-LPC filter is then obtained with: 

A (z) if distance (a, (z), (z)) > thr3 

J V ' / (B.17) 

A^(z) otherwise 

The threshold value thr3 is fixed to 1.12202 and the distance between the current LPC filter and the 
past average one is calculated in the same manner as in B.4.L3 (see equation B.12). 

Then the SDD-LPC filter is transformed to the LSF domain for quantization: The LSFs are quantized 
by a two-stage switched predictive vector quantization ("VQ") with 5 and 4 bits each. The 
quantization of the LSF vector entails the determination of the best three indices. The first index is 
that of the predictor. The last two indices are each taken from a different vector table, as it is done in 
a two stage vector quantization. The overall quantization procedure follows the one given in 
3.2.4/G.729 with the following modifications: 

1) The second 4th order MA predictor used in Recommendation G.729 is modified as a linear 
combination of the first and second MA predictors as follows: 

P/.it,2=0.6p,jt,i+0.4p,.,,2 (B.18) 

where 

i = l 10, fc = l,...,4 

2) The first stage VQ quantization is similar to the one used in Recommendation G.729. 
However, only a portion of the first table of the quantizer is used. The relevant subset entries 
of the table are stored in an auxiliary lookup table with 32 address indices. Moreover, a 
delayed decision quantization is used by keeping few candidates as inputs to the second 
stage. ' ' ' ' 

3) The candidates from the first stage in conjunction with those of the second stage are used by 
the second stage VQ. The second stage VQ quantization is different from the one used in 
Recommendation G.729. A full VQ is used as compared to the split VQ of 
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Recommendation G.729. Only a portion of the second stage tables is used as well. The 
relevant subset entries are stored in another lookup table with two 16 address entries. The 
combination of the predictor, a vector from the first stage and a vector from the second stage, 
leading to the minimum distortion in the weighted mean square error sense, is chosen as the 
LSF descriptor. 

B.4.3 SEO bit streani description 

The bit stream related to the transmission of an SID frame is described in Table B.2. The bit stream 
related to the transmission of an active frame is defmed in Table 8/G.729. The bit stream ordering is 
reflected by the order in the table. For each parameter the Most Significant Bit (MSB) is transmitted 
first. . 



TABLE B.2/G.729 



Parameter description 


Bits 


Switched predictor index of LSF quantizer 


1 


First stage vector of LSF quantizer 


5 


Second stage vector of LSF quantizer 


4 


Gain (Energy) 


5 



B.4.4 Non-active encoder/decoder (CNG) description 

At the decoder part, the comfort noise is generated by introducing a pseudo- white excitation signal of 
controlled level into interpolated LPC filters, in the same manner than the decoder produces active 
speech by filtering the decoded excitation. The excitation level and LPC filters are obtained from the 
previous SID information. The subframes interpolated LPC filters are obtained by using the 
SID-LSPs as current LSPs and performing the interpolation with the previous frame LSPs as done 
for active frames in Recommendation G.729. 

The pseudo-white excitation ex(n) is a mixture between an excitation of the same type as the active 
speech one ex^(n) and a white Gaussian excitation ex2(n) . 

The G.729 excitation ejc,(n) is composed of an adaptive excitation with a small gain and an ACELP 
fixed excitation, which improves the transition between active and non-active voice frames. The 
addition of a Gaussian excitation exjCn) allows the generation of a whiter signal. 

Since the encoder and decoder need to keep synchronized during non-active voice periods, the 
excitation generation is performed on both sides, for SID frames and for untransmitted frames. 

First, let us define the target excitation gain as the square root of the average energy that must be 
obtained for the current frame t synthetic excitation. is calculated using the following smoothing 
procedure, where G^y^ is the SID gain derived for the decoded SID gain: 



if Vad,,i^\ 
otherwise 



(B.19) 



The 80 samples of the frame are divided into 2 subframes of 40 samples. For each subframe, the 
CNG excitation samples are synthesized using the following algorithm. 



A pitch lag is randomly chosen in the interval [40,103]. 
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Next, the fixed codebook vector of the subframe is built by random selection of the grid, the pulses 
signs and positions, according to the G.729 ACELP code structure. 

An adaptive excitation signal of unity gain is then calculated, noted (n), n = 0 to 39 . The selected 
subframe fixed excitation will be noted e j (n), h = 0 to 39 . 

The adaptive and fixed gains Ga and Gf are then computed in order to yield a subframe average 
energy equal to , which is expressed by: 

39 



^ £(Ga X (n) + G/ X 6^ (n))' = G; 



2 



(B.20) 



Notice that G/can take a negative value. 



Let us define Ea = 



f39 ^ 



(1X9 \ 

^e^(n)efin)\ and K^AOxCf 

Kn=0 ) 



39 



Due to the ACELP excitation structure ^Cjinf' = 4 

rt=0 

If we fix randomly the adaptive gain Ga, then equation B.19 becomes a second order equation on the 
fixed gain Gf. 



. GaxI EaxGa-K 
Gf^-^— — G/+ = 0 



(B.21) 



A constraint may be imposed on Ga to be sure that this equation has a solution. Furthermore it is 
desirable to forbid the use of large adaptive gains. For this, the adaptive gain Ga will be randomly 
chosen in: 



0, Max\05, 



withA = £a-/V4 



(B.22) 



The root of equation B.20 that has the lowest absolute value is selected for Gf. 
Finally the G.729 excitation is built, using: 

exi(n) = GaXe^(n) + G/X£y^[n], n = 0 to 39 (B.23) 

The method of deriving the composite excitation signal ex(n) is as follows: 

Let Ey be the energy of ejc,(n) , E2 be the energy of ex2(n) . ex2{n) has a unit variance and a zero 
mean. Let £3 be the cross-energy between ex,(n) and ex2{n) . 

Ei = ^€xl{n) 

E2^^exl{n) (B.24) 
E^ -^exi{n).ex2in) 

where the summation is over the subframe size. 

Let a and P be the scale proportion of ex^(n) and ejCjCn) used in the mixture excitation respectively, 
a is set to be 0.6. p is found as the solution to the following quadratic equation: 
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p^£2 + 2ap£:3 + (a^ - i)e^ = 0, with (i > 0 (B.25) 

If no solution is found for p, it is set to 0 and a to 1 . * 
The CNG excitation ex(n) becomes: 

exi (n) = a ex^ (n) + ^ex2 (n) (B .26) 

B.4.5 Frame erasure concealment with regards to the CNG 

When a frame erasure is detected by the decoder, the erased frame type depends on the preceding 
frame type: 

- if the preceding frame was active, then the current frame is considered as active; 

- else if the preceding frame was either a SID frame or an untransmitted frame, the current 
erased frame is considered as untransmitted: 

If an untransmitted frame has been erased, no error is then introduced. 
If a SID frame is erased, there are two possibilities: 

• If it is not the first SID frame of the current inactive period, then the previous SID 
parameters are kept. 

• If it is the first SID frame of an inactive period, a special protection has been taken. 
Notice first that this case is detected by the fact that Ftyp^^^ = 1 and Ftyp^ = 0 . 

This combination of events does not imply that the preceding frame was a good active frame: several 
frames up to the preceding one may have been erased. What is certain is that the last good frame was 
an active frame, that the present frame was not erased, and that the SK) frame supposed to provide 
information for the current untransmitted frame is lost. 

To recover the SK) information, the CNG module uses parameters provided by the G.729 decoder 
main part: 

• the LSPs of the last valid active frame are used for the SID-LPC filter; 

• an energy term is calculated on the excitation signal by the decoder during the processing of 
all valid active voice frames. To recover the missing SID gain Gj^ , the energy term of the 
last valid active frame is quantized with the SID gain quantizer and decoded. 

Finally to avoid de-synchronization of the random generator used to compute the excitation, the 
pseudo-random sequence reset is performed at each active frame, both at the encoder and decoder 
parts. 

Bit-exact description of the silence compression scheme 

The silence compression scheme is simulated in 16-bit fixed-point ANSI-C code using the same set 
of fixed-point basic operators defined in Table 11/G.729. The ANSI-C code constitutes an integral 
part of this Recommendation reflecting the bit-exact, fixed-point description of the silence 
compression scheme. In the event of any discrepancy between the printed text of this 
Recommendation and the C source, the C-source code is presumed to be correct. 
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B.5.1 Organization of the simulation software 

Same as 5.2/0.729, 

The Annex B ANSI-C software modules are listed in Table B.3. Refer to. the read.me file provided 
with the software for more details. 



TABLE B.3/G.729 



G.729 Annex B ANSI-C module names 


Description 


Vadx 


VAD 


Dtx.c 


U 1 A L/ocision 


Qsidgain.c^ 


SID Gain Quantization 


QsidLSF.c 


SID-LSF Quantization 


Calcexcx 


CNG Excitation Calculation 


Dec_sid.c 


Decode SID Infonnation 


Miscel.c 


Miscellaneous Calculations 


G-729 Annex B ANS1-C,h file names 


Description 


Vad.h 


Prototype and Constants 


Dtx.h 


Prototype and Constants 


Sid.h 


Prototype and Constants 


MisceLh 


Prototype and Constants 
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ITU-T RECOMMENDATIONS SERIES 

Series A Organization of the work of the ITU-T 

Series B Means of expression 

Series C General teleconununication statistics 

Series D General tariff principles 

Series E Telephone network and ISDN 

Series F Non-telephone telecommunication services 

Series G Transmission systems and media 

Series H Transmission of non-telephone signals 

Series! Integrated services digital network 

Series J Transmission of sound-programme and television signals 

Series K Protection against interference 

Series L Construction, installation and protection of cables and other elements of outside plant 

Series M Maintenance: international transmission systems, telephone circuits, telegraphy, 
facsimile and leased circuits 

Series N Maintenance: international sound-programme and television transmission circuits 

Series O Specifications of measuring equipment 

Series P Telephone transmission quality 

Series Q Switching and signalling 

Series R Telegraph transmission 

Series S Telegraph services terminal equipment 

Series T Terminal equipments and protocols for telematic services 

Series U Telegraph switching 

Series V Data communication over the telephone network 

Series X Data networks and open system communication 

Series Z Programming languages 



